Does topography have an effect on the likelihood of car break-ins? This regression analysis seeks to provide insights into this question using data from San Francisco, which has ideal geography for this context.

That is one steep street!
This analysis will follow the specific steps outlined in Table 1 below:
data.frame("Step" = c(1:8),
"Description" = c("Identify research question",
"Select key variables",
"Collect data",
"Describe data",
"Visualize relationships",
"Test OLS assumptions",
"Conduct regression analysis",
"Interpret results")) %>%
kable(caption = "Analysis Plan") %>%
kable_paper(full_width = FALSE) %>%
kable_styling(latex_options = "striped",
font_size = 15) %>%
column_spec(1, bold = T) %>%
row_spec(0, bold = T, color = "black")
| Step | Description |
|---|---|
| 1 | Identify research question |
| 2 | Select key variables |
| 3 | Collect data |
| 4 | Describe data |
| 5 | Visualize relationships |
| 6 | Test OLS assumptions |
| 7 | Conduct regression analysis |
| 8 | Interpret results |
My research question is: What is the relationship between topography and car break-ins in San Francisco?
Both terrain and motor vehicle crimes are ubiquitous when discussing living or visiting San Francisco. In April 2021, Young-An Kim & James C. Wo published Topography and crime in place: The effects of elevation, slope, and betweenness in San Francisco street segments. This study supports the idea that “hilliness” has an effect on crime, taking into consideration socio-economic characteristics Kim and Wo (2021). However, their analysis did not separate by specific crime categories and instead included all types of crime, including violent, nonviolent, property, etc.
My analysis will focus only on car break-ins rather than all crime reports, as I believe that these crimes will have a more significant relationship with topography. In addition, my analysis will use more recent crime report and socio-economic data sets. Understanding the relationship between topography and car break-ins can influence local-level policy decisions and direct limited resources dedicated to crime prevention. For example, the city could focus enforcement and policy on areas within certain elevation and slope ranges that are identified in this analysis as being particularly susceptible to car break-ins.
Table 2 below contains the key measures of interest in our analysis:
data.frame("Dependent Variable" = c("Number of Car Break-Ins", ""),
"Independent Variables" = c("Elevation", "Slope"),
"Control Variable" = c("Median Income","")) %>%
kable(col.names = c("Dependent Variable", "Independent Variables", "Control Variable"), caption = "Key variables for regression analysis") %>%
kable_paper(full_width = FALSE) %>%
kable_styling(latex_options = "striped",
font_size = 15) %>%
column_spec(1, bold = T) %>%
row_spec(0, bold = T, color = "black")
| Dependent Variable | Independent Variables | Control Variable |
|---|---|---|
| Number of Car Break-Ins | Elevation | Median Income |
| Slope |
When discussing topography, both elevation and slope are necessary for inclusion because these two capture the effect of local level topography. In any econometric analysis, it is vital to control for socio-economic variables. Thus, median income is included as a control variable.
Here is the regression equation:
\[NumBreakIns_i = \beta_0 + \beta_1Elevation_i + \beta_2Slope_i + \beta_3MedianIncome_i + u_i\]
Based on the existing literature by Kim & Wo, it is expected that there is a negative correlation between number of car break-ins and all three independent variables.
tidycensus package via the US Census Bureau.Since our variables interact over space, we merge the crimes, topography, and median income data sets by spatially joining them together. All data sets are loaded into R as shapefiles in the Coordinate Reference System (CRS) EPSG:7132 (NAD83(2011) / San Francisco CS13 (ftUS)), which is a projected CRS for the city and county of San Francisco adequate for high-precision (0.03 ft) analysis.
# Find index of nearest contour to each crime
elev <- st_nearest_feature(x = crimes, y = contours)
# Add elevation and binary slope columns
crimes <- crimes %>%
st_join(y = census_geom, join = st_within, left = TRUE) %>%
mutate(elev = contours[elev,]$elevation) %>%
rename(median_income = estimate) %>%
select(date_incid, slope, median_income, elev, geometry)
# Group by all three variables
crimes_summary <- crimes %>%
st_drop_geometry() %>%
group_by(slope, median_income, elev) %>%
summarize(count = n())
The summary statistics for the data are provided in Table 3 below.
crimes %>%
st_drop_geometry() %>%
select(slope, elev, median_income) %>%
psych::describe(fast=TRUE) %>%
kable(col.names = c("", "Count", "Mean", "SD", "Min", "Max", "Range", "SE"), caption = "Summary statistics for Slope, Elevation, and Median Income variables") %>%
kable_paper(full_width = FALSE) %>%
kable_styling(latex_options = "striped",
font_size = 15) %>%
column_spec(1, bold = T) %>%
row_spec(0, bold = T, color = "black")
| Count | Mean | SD | Min | Max | Range | SE | ||
|---|---|---|---|---|---|---|---|---|
| slope | 1 | 24454 | 4.492966 | 4.793824 | 0 | 68 | 68 | 0.0306554 |
| elev | 2 | 24454 | 129.321379 | 120.439659 | -5 | 780 | 785 | 0.7701841 |
| median_income | 3 | 23764 | 114039.269567 | 47501.416992 | 12340 | 208425 | 196085 | 308.1390882 |
We can see in the count column that there are 24,454 total break-ins in our data, with some missing values for median income. The slope ranges from 0 to 68 percent (~34 degrees) with a mean value of approximately 4.5 percent. This indicates that there are vastly more break-ins on flatter streets as expected. The elevation ranges from -5 to 780 feet with a mean value of approximately 130 (note: negative values are a result of attaching the nearest contour level to crimes very close to the coastline). Again, there are more break-ins at lower elevations than higher. The median income ranges from ~$12,000 to ~$208,000 with a mean value of approximately $114,000. This indicates that our median income data appears to be somewhat evenly distributed.
The box plots below visualize these observations:
slope_box <- ggplot(data = crimes, aes(x = "", y = slope)) +
geom_boxplot(color = "orange") +
geom_jitter(aes(color = slope),
width = 0.2,
size=0.4,
alpha=0.025,
show.legend = FALSE) +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank()) +
labs(x = "Slope (percent)")
elev_box <- ggplot(data = crimes, aes(x = "", y = elev)) +
geom_boxplot(color = "darkgreen") +
geom_jitter(aes(color = elev),
width = 0.2,
size=0.4,
alpha=0.025,
show.legend = FALSE) +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank()) +
labs(x = "Elevation (feet)")
income_box <- ggplot(data = crimes, aes(x = "", y = median_income)) +
geom_boxplot(color = "blue") +
geom_jitter(aes(color = median_income),
width = 0.2,
size=0.4,
alpha=0.025,
show.legend = FALSE) +
theme_classic() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank()) +
labs(x = "Median Income (USD)")
elev_box + slope_box + income_box + plot_annotation(title = 'Figure 1. Boxplots of Independent Variables',
theme = theme(plot.title = element_text(hjust = 0.5)))

The box plots more clearly demonstrate that both elevation and slope have more occurrences at low values than high values, with high outliers. These extreme values appear to be accurate - the highest peak in San Francisco is Mount Davidson at 938 feet and the steepest surveyed road is Bradford Street at 41% grade. This indicates that two of the slope outliers are likely due to data artifacts.
The following maps visualize the four variables:

Visualizing the spatial distribution of crime data in relation to the distributions of the three independent variables (elevation, slope, and median income).
The following plots in Figure 2 show the simple relationships between count of car break ins and the three independent variables (elevation, slope, and median income).
# Group by income
income_summary <- crimes %>%
st_drop_geometry() %>%
group_by(median_income) %>%
summarize(count = n())
income_plot = ggplot(data = income_summary, aes(x = median_income, y = count)) +
geom_point(alpha = 0.5, color = "darkgreen") +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
labs(title = "Crime and Median Income",
x = "Median Income (USD)",
y = "Number of Break-Ins") +
theme(title = element_text(size = 10))
# Group by elevation
elev_summary <- crimes %>%
st_drop_geometry() %>%
group_by(elev) %>%
summarize(count = n())
elev_plot <- ggplot(data = elev_summary, aes(x = elev, y = count)) +
geom_point(alpha = 0.5, color = "orange") +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
labs(title = "Crime and Elevation",
x = "Elevation (feet)",
y = "Number of Break-Ins") +
theme(title = element_text(size = 10))
# Group by slope
slope_summary <- crimes %>%
st_drop_geometry() %>%
group_by(slope) %>%
summarize(count = n())
slope_plot <- ggplot(data = slope_summary, aes(x = slope, y = count)) +
geom_point(alpha = 0.5, color = "blue") +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
labs(title = "Crime and Slope",
x = "Slope (percent)",
y = "Number of Break-Ins") +
theme(title = element_text(size = 10))
elev_plot + (slope_plot / income_plot) + plot_annotation(title = 'Figure 2. Simple relationships',
theme = theme(plot.title = element_text(hjust = 0.5)))

There is a negative correlation between elevation and crime. It also appears that this relationship is not linear, so the model may fit better if transformed. Additionally, there is a negative correlation between slope and crime also showing signs of a non-linear relationship, supporting the justification for a transformation. Lastly, there appears to be a weak negative correlation, or even possibly no significant relationship, between median income and car break-ins.
Lastly, the figure below visualizes the spatial distribution of the crime data within the city and provides a relative idea of the areas of higher elevation. Crime data is distributed across most of the study area. Additionally, regions with significant terrain (such as Twin Peaks, Potrero Hill, and Nob Hill) can be identified by the areas with concentrations of darker green points.